IDa-Det: An Information Discrepancy-Aware Distillation for 1-bit Detectors
175
Proposal Pair ܴଵ
௧ǡ ܴଵ
௦
Proposal Pair ܴଶ
௧ǡ ܴଶ
௦
Proposal Pair ܴଷ
௧ǡ ܴଷ
௦
Proposal Pair ܴସ
௧ǡ ܴସ
௦
Paired ܴ௦in Student
ܴ௦in Student
ܴ௧in Teacher
Paired ܴ௧in Teacher
FIGURE 6.16
Illustration for the generation of the proposal pairs. Every single proposal in one model
generates a counterpart feature map patch in the same location as the other model.
channel-wise proposal feature and measure the discrepancy as
εn =
C
c=1
||(Rt
n;c −Rs
n;c)T Σ−1
n;c(Rt
n;c −Rs
n;c)||2,
(6.83)
where Σn;c denotes the covariance matrix of the teacher and the student in the c-th channel
of the n-th proposal pair. The Mahalanobis distance takes into account both the pixel-
level distance between proposals and the differences in statistical characteristics in pair of
proposals.
To select representative proposals with maximum information discrepancy, we first de-
fine a binary distillation mask mn as
mn =
1, if pair (Rt
n, Rs
n) is selected
0, otherwise
(6.84)
where mn = 1 denotes that the distillation will be applied on this proposal pair; otherwise,
it remains unchanged. For each pair of proposals, only when their distribution is quite
different can the student model learn from the teacher counterpart where a distillation
process is needed.
On the basis of the derivation above, discrepant proposal pairs will be optimized through
distillation. To distill the selected pairs, we resort to maximizing the conditional probability
p(Rs
n|Rt
n). That is, after distillation or optimization, the feature distributions of the teacher
proposals and the student counterparts become similar. To this end, we define p(Rs
n|Rt
n)
with mn, n ∈{1, · · · , NT + NS} in consideration as
p(Rs
n|Rt
n; mn) ∼mnN(μt
n, σt
n
2) + (1 −mn)N(μs
n, σs
n
2).
(6.85)
Subsequently, we introduce a bilevel optimization formulation to solve the distillation prob-
lem as
max
Rsn
p(Rs
n|Rt
n; m∗), ∀n ∈{0, · · · , NT + NS},
s.t. m∗= arg max
m
NT +NS
n=1
mnεn,
(6.86)
where m = [m1, · · · , mNT +NS] and ||m||0 = γ · (NT + NS). γ is a hyperparameter. In
this way, we select γ · (NT + NS) pairs of proposals that contain the most representative